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Abstract 

Nonparametric rank tests for homogeneity and component inde- 
pendence are proposed, which are based on data compressors. For 
homogeneity testing the idea is to compress the binary string obtained 
by ordering the two joint samples and writing if the element is from 
the first sample and 1 if it is from the second sample and breaking 
ties by randomization (extension to the case of multiple samples is 
straightforward). Hq should be rejected if the string is compressed (to 
a certain degree) and accepted otherwise. We show that such a test 
obtained from an ideal data compressor is valid against all alterna- 
tives. 

Component independence is reduced to homogeneity testing by 
constructing two samples, one of which is the first half of the original 
and the other is the second half with one of the components randomly 
permuted. 

1 Introduction 

We consider two classical problems of mathematical statistics. The first one 
is homogeneity testing: two (or more; see below) samples Xi,...X n and 
Yi, . . . ,Y n with elements in R are given. It is assumed that the elements are 
drawn independently and within samples the distribution is the same. We 
want to test the hypothesis H that X, and are distributed according to 
the same distribution versus Hi that the distributions generating the samples 



1 



are different. This is called homogeneity testing. Absolutely no assumptions 
are made on the distributions. 

The second one is component independence: a sample Z\, . . . , Z n is given, 
generated i.i.d. according to some distribution Fz- Each element Zi consists 
of two (or more) components Z\ and Zf. We wish to test whether the 
components are independent of each other. That is, H is that the marginal 
distributions are independent whereas Hi is that there is some dependency. 
Again, no assumption is made on the distribution Fz- 

Both problems are well-known problems of nonparametric mathemati- 
cal statistics. For example, a classical test for homogeneity is Kolmogorov- 
Smirnov test (which assumes, however, that the distributions generating the 
samples are continuous). There are many other nonparametric tests; some of 
the tests use ranks of elements within the joint sample, instead of using the 
actual samples. Such is, for example, Wilcoxon's test, see [3] for an overview 
(which also makes some additional assumptions on the distribution). 

In this work we present simple nonparametric (distribution-free) rank 
tests for homogeneity and component independence based on data compres- 
sors. 

The idea to use real-life data compressors for testing classical statistical 
hypotheses, such as homogeneity, component independence and some oth- 
ers, was suggested in [7J [8]. In these works statistical tests based on data 
compressors are constructed which fall into the classical framework of non- 
parametric mathematical statistics, in particular, the Type I error is fixed 
while Type II error goes to under a wide range of alternatives. The hy- 
potheses considered there mostly concern data samples drawn from discrete 
(e.g. finite) spaces. Some tests for continuous spaces are also proposed based 
on partitioning. Here we extend this approach to rank tests, allowing testing 
homogeneity and component independence without the need of partitioning 
the sample spaces and making them finite. The idea of using data compres- 
sors for tasks other than actual data compression was suggested in [TJ [2j H] , 
where data compressors are applied to such tasks as classification and clus- 
tering. These works were largely inspired by Kolmogorov complexity, which 
is also an important tool for the present work. 

An "ideal" data compressor is the one that compresses its input up to its 
Kolmogorov complexity. This is intuitively obvious since, informally, Kol- 
mogorov complexity of a string is the length of the shortest program that 
outputs this string. Such data compressors do not exist; in particular, Kol- 
mogorov complexity itself is incomputable. Real data compressors, however, 
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can be considered as approximations of ideal ones. 

In this work we provide a simple empirical procedure for testing homo- 
geneity and component independence with data compressors; we show that 
for an ideal data compressor this procedure provides a statistical test which 
is valid against all alternatives (Type II error goes to zero); while Type I er- 
ror is guaranteed to be below a pre-defined level (so-called significance level) 
for all data compressors, not only for ideal ones. It should also be noted 
that the theoretical assumption underlying data compressors used in real life 
is that the data to compress is stationary. Thus the tests designed in [8] 
are provably valid against any stationary and ergodic alternative, while these 
tests are based on real data compressors, not only on ideal ones. In our case, 
the alternative arising in rank test under Hi is not stationary. Thus we prove 
theorems only about ideal data compressors, and real data compressors can 
be used heuristically. However, it can be conjectured that the same results 
can be proven for some particular real-life data compressors, for example for 
those which are based on the measure R from [6] or on the LZ algorithm |TTJ] . 

2 Homogeneity testing 

Homogeneity testing is the following task. Let there be given two samples 
X = {Xi, . . . X m } and Y = {Y 1 , . . . , Y k } (the case of more than two samples 
will also be considered). X, are drawn independently according to some 
probability distribution Fx on M. d (d G N) and Yi are drawn independently 
from each other and from Xi according to some distribution Fy on M. d . The 
goal is to test whether Fx = Fy. No assumption is made on the distributions 
Fx and Fy; we only assume that Xi and Yi are drawn independently within 
the samples and jointly. So, we wish to test the hypothesis H = {(Fx, Fy) : 
F x = F Y } against H x = {{F x , F Y ) : F x ^ F Y }. 

A code (p is a function tp : B* — > B* from the set of all finite words 
over binary alphabet B = {0, 1} to itself, such that <p is an injection (that 
is, a ^ b implies <p(a) ^ (p(b) for a,b G B*). A trivial example of a code 
is the identity (fid(ci) = a. Less trivial examples that we have in mind are 
data compressors, such as zip, rar, arj , or others, which take a word and 
output a "compressed" version of it (which in fact is often longer than the 
original) from which the original input can always be recovered. We will 
construct (reasonable) tests for homogeneity from (good) data compressors. 

First let us assume that d = 1 (that is, Xi, Yi G R). Let Zi < Z 2 < ■ ■ ■ < 
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Z m +k denote the joint sample constructed by ordering jointly two samples X 
and Y. Construct the word A — Ai ... , A m+ f, as follows: for each i Ai — if 
Zi is taken from the sample X (Zi e X) and Aj = 1 if Zi is from the sample 
Y [Zi G Y) where ties are broken by randomization: if Zj = Zj + \ = . . . Zy 
and there are m' elements of the sample X which are equal to Zj and k' 
elements of the sample Y which are equal to Zj then the word Aj . . . Ay is 
chosen randomly from all ^rjpr binary words which have w! zeros and k! 
ones, assigning equal probabilities to all words. 

Now consider the case d > 1, that is, the elements of the samples X 
and Y are from R d , d > 1. Construct samples X = X 1: . . . ,X m and Y = 
Yi,...,Y m as follows: X t := x] 1 , xf 1 , . . . , xf 1 , x} 2 , x 22 , . . . , xf 2 , . . . where x\ j 
is the jth element in the binary expansion of the ith component of X t (in 
case the expansion is ambiguous always take the one with more zeros), and 
analogously for Y. Denote the described function which converts X to X 
by r. Construct the string A applying the (single-dimensional) procedure 
described above to the samples X and Y. 

Let | if | denote the length of a string K. 

Definition 1 (Homogeneity test G^). For any code <p the test for homogene- 
ity Gp is constructed as follows. It rejects the hypothesis H ( outputs reject ) 
at the level of significance a if 

\<p{A)\ < \ogaN (1) 

where N := and log is base 2, and accepts H (outputs accept ) other- 

wise. 

Definition 2 (More than two samples). In case we are given r samples 
where r > 2 and wish to test H that they all are generated according to the 
same distribution versus at least two distributions are different, the test is the 
same, except for that the string A is not binary but from r-element alphabet 
and in the test above instead of N take 

rUi m i 

where rrii are the sizes of the samples. 

The intuition is as follows. Observe that if the distributions Fx and Fy 
are equal (that is, H is true), then the string A is just a random binary string 
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with m zeros and k ones; all such strings have equal probabilities under H Q . 
Thus a good data compressor should be able to compress it to about logiV 
bits, but no code can compress many such strings to less than log N — t bits 
(t > 0), since there are N such strings and only 2~ t N binary strings of length 
logiV - 1. 

Proposition 1 (Type I error). Let d — 1. For any code if and any a G [0, 1] 

the Type I error of the test G v with level of significance a is not greater than 
a: 

P{X, Y : G V (X, Y) = reject} < a (2) 
for allP=(F x ,F Y )eH . 

Remark 1. The proposition still holds if Hq is rejected when 

\<p(A)\ <(k + m)h ( , k ) + loga - log(fc + m), (3) 
\k + m J 

where h(t) is the entropy 

h(t) := -tlogt- (1 -t)log(l-t). (4) 
In case of r samples |3j] takes the form 

\<p{A)\ < nhlogr + log a — logn (5) 
with n = E[ =1 m, andh = - ^I =1 ^ log ^. 

Proof. As it was noted, under Hq for every string a € j^k+m a 
consists of m zeros and ones P(A = a) = 1/JV (that is, all such strings are 
equiprobable) . Since there are only aN binary strings of length logaTV" and 
ip is an injective function, that is each codeword is assigned to at most one 
word, we get P{X,Y : \ip(A)\ < logaiV} < -^Na = a which together with 
the definition of G v implies (j2J). 

The statement of the Remark can be derived from Stirling's expansion 
for N and N'. □ 

Remark 2. The term — \og(k + m) in ^ is due to the fact that there are 
only strings with m zeros and k ones (among 2 k+m all binary strings 

of this length). So the code (p can specifically assign shorter codewords to 
these strings. As real data compressors are not designed to favour strings of 
this particular ratio of zeros and ones, in practice it is recommended to omit 
the term — log (A; + m) in |3j]. The same concerns the term — logn in (Tjp. 
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Obviously, for some codes the test is useless (for example if tp is the 
identity mapping) and Proposition [T] is only useful when the Type II error 
goes to zero. Next we will define "ideal" codes (the codes that compress a 
word up to its Kolmogorov complexity) and show that for them indeed the 
probability of accept goes to zero under any distribution in H\. 

Informally, Kolmogorov complexity of a string A is the length of the short- 
est program that outputs A (on the empty input). Clearly, the best, "ideal", 
data compressor can compress any string A up to its Kolmogorov complexity, 
and not more (except may be for a constant). Next we present a definition 
of Kolmogorov complexity; for fine details see [U [5]. The complexity of a 
string A G B* with respect to a Turing machine ( is defined as 

C C (A) = min{/(j9) : ((p) = 
v 

where p ranges over all binary strings (interpreted as programs for £; mini- 
mum over empty set is defined as oo). There exists a Turing machine ( such 
that C^(A) < C^i (A) + C£/ for any A and any Turing machine (the constant 
Cf/ depends on (' but not on A). Fix any such ( and define Kolmogorov com- 
plexity of a string A G {0, 1}°° as C(A) := C C (A). Clearly, C(A) < \A\ + b 
for any A and for some b depending only on (. 

Definition 3 (ideal codes). Call a code if ideal if some constant c the equality 
1^(^4)1 < C(A) + c holds for any binary string A. 

Clearly such codes exist. 

Proposition 2 (Type II error: universal validity). For any ideal code <p 
Type II error of the test G v with any fixed significance level a > goes to 
zero P{X,Y : G V (X,Y) = accept} — > for any P in Hi if k, m — > oo in 
such a way that 0<a<^<&<l for some a, b. 

Proof. First observe that the function r that converts <i-dimensional samples 
X and Y to single- dimensional samples X and Y has the following proper- 
ties: if X and Y are distributed according to different distributions then X 
and Y are also distributed according to different distributions. Indeed, r is 
one to one, and transforms cylinder sets (sets of the form {x G M d : x 11 ^ 1 = 
bi,...,x itjt = b t ;b t G {0,l},t,ii,ji G N(l < I < t)}) to cylinder sets. So 
together with Fx {Fy) it defines some distribution F x (F Y ) on K. If distri- 
butions Fx and Fy are different then they are different on some cylinder set 
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T, but then F x (t(T)) ^ F y {t(T)). Thus further in the proof we will assume 
that d — 1. 

We have to show that Kolmogorov complexity C(A) = \<p(A)\ of the 
string A is less than log ctiV > (k + m)h (j^) + logo; — log(/c + m) for any 
fixed a from some k, m on. To show this, we have to find a sufficiently short 
description s(A) of the string A; then the Kolmogorov complexity |y?(v4)| is 
not greater than |s(v4)| + c where c is a constant. 

If Hi is true then Fx ^ Fy and so there exist some interval T = (— oo, t] 
and some 5 > such that \Fx(T) — Fy(T)\ > 25. Then we will have 

#{x£Xnr} #{ y eYnT} 

m k 

from some k, m on with probability 1. 

Let A' be the starting part of A that consists of all elements that belong 
to T and let m' : = #{x G X fl T} and k' := #{y G Y PI T}. A description of 
A' can be constructed as the index of A' in the set (ordered, say, lexicograph- 
ically) of all binary strings of length m' + k! that have exactly m' zeros and k' 
ones plus the description of m' and k! . Thus the length of such a description 
is bounded by log < (m' + k')h(^jj) plus \ogk' + \ogm' + const 

(the inequality follows from n\ < n n for all n). Let A denote the remaining 
part of A (that is, what goes after A'). The length of the description of A 
is bounded by (m + k)h(-=~) + log A; + logm + const where fn = m — m' 

and k = k — k' . Since h is concave and — ^-r is between , fc ' , , and ^tt, from 

m+fc m'+Ar m+k ' 

Jensen's inequality we obtain 

fc C * ) _ (k\ + *±* fc f_L_ 

\m + k J \m + k \m' + k J m + k \m + k 

Denote this difference by 7 (A;, m, k', m'). Let 7 = inf 7 (A;, m, k', m') where 
the infimum is taken over all pairs k, m that satisfy the condition of the 
proposition 0<a<^<&<l and k',m' that satisfy ([6]). It follows that 
inf 1-^7 — — I > and inf |4 — — I > 0. Thus, 7 is positive and depends only on 
a, b and 5. To uniquely describe A we need the description of A' and A and 
also k and m; these have to be encoded in a self-delimiting way; the length 
of such a description s(A) is bounded by the lengths of description of A', A 
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(6) 
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plus log (A; + m) and some constant. Thus 



(k + m)h + log a - log (A; + m) - VpiA) | > 

k + m J 

(k + m)h ( J + log a — 2 log(/c + m) 

\k + m J 



m + k \m' + kl ) m + k \m + k / 

> (k + m)7 — 2 log(fc + m) — c 

for some constant c; clearly, this expression is greater than from some k, m 
on. □ 

So, as a corollary of Propositions [Hand [2] we get the following statement. 

Theorem 1. For any code ip and any a G (0, 1] the Type I error of the test 
G<p with level of significance a is not greater than a. If, in addition, the 
code ip is ideal then the Type II of G v error tends to as the sample size n 
approaches infinity. 



3 Component independence testing 

Component independence testing is the following task. A sample Z = 
Zi, . . . ,Z n is given where each Zj consists of r components Zf, Zf, . . . , Z\, 
Z\ G M dj . The sample is generated according to some probability distribution 
Fz on M. d , where d := YTj=i &r The goal is to test whether the components 
are distributed independently. That is, H is that 

r 

F Z {Z{ eT u ...,2%e T r ) = J] F Z {Z{ e T 3 ) (7) 

3=1 

for all measurable Tj C , 1 < j 1 < r. Hi is the negation of Hq (the 
equality ([7j) is false for some selection of the sets Tj, 1 < j < r). Again, no 
assumption is made on the form of the distribution Fz- 

Fix any code ip and construct the test for component independence I v as 
follows. Assume that n = 2m for some m and define the samples X and Y 
as the first and the second half of the sample Z: Xi — Zi, . . . ,X m = Z m 
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and Y\ = Z m+1 , . . . ,Y m = Z 2m (if n is odd then make samples X and Y of 
sizes [n/2] and n — \n/2\). Construct the sample Y from Y by permuting 
the components independently: Y? = Y%.^, 1 < i < m, 1 < j < r where 
nj are permutations 1 . . . m, selected at random (with equal probabilities) 
independently of each other. 

Definition 4 (Component independence test I v ). The test I 9 (with level of 
significance a) consists in application of the test for homogeneity G v to the 
samples X and Y (with level of significance a). 

Indeed, it is easy to check that H is true if and only if X and Y are 
distributed according to the same distribution. So we get the following state- 
ment. 

Theorem 2. For any code ip and any a G (0, 1] the Type I error of the test 
I v with level of significance a is not greater than a. If, in addition, the code 
ip is ideal then the Type II error of 1^ error tends to as the sample size n 
approaches infinity. 
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